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Abstract 

We propose light-weight lease primitives to leverage fault-tolerant coordination 
among clients accessing a shared storage infrastructure (such as network attached disks 
or storage servers). In our approach, leases are implemented from the very shared data 
that they protect. That is, there is no global lease manager, there is a lease per data 
item (e.g., a file, a directory, a disk partition, etc.) or a collection thereof. Our lease 
primitives are useful for facilitating exclusive access to data in systems satisfying cer- 
tain timeliness constraints. In addition, they can be utilized as a building block for 
implementing dependable services resilient to timing failures. In particular, we show 
a simple lease based solution for fault-tolerant Consensus which is a benchmark dis- 
tributed coordination problem. 
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1 Introduction 

Motivation. In recent years, advances in hardware technology have made possible a new 
approach for storage sharing, in which clients access disks directly over a storage area net- 
work (SAN). By allowing the data to be transferred directly from network attached disks to 
clients, SAN has the potential to improve scalability (through eliminating the file server bot- 
tleneck) and performance (through shorter data paths). However, without properly restrict- 
ing concurrent access to shared data by clients, shared data would be rendered inconsistent. 
Therefore, a scalable and efficient locking support is widely recognized as a key requisite for 
realizing the SAN technology's full potential. 

The traditional approach to implementing locks in SAN-based file systems designates 
a lock manager to administer shared access [11, 32], thus creating a performance and an 
availability bottleneck. An alternative approach, put forth in this paper, is to employ a 
storage-centric locking, i.e., to co-locate locks with the very data items that are protected by 
these locks. This way, the cost of locking is folded into the cost of accessing the data itself, 
and the locks availability is the same as that of the data itself. The challenge is in providing 
an efficient and fault-tolerant implementation. 

Fault tolerance. A naive per-datum lock design would associate a strong object that di- 
rectly implements locking (such as test-and-set) with each data item. However, this approach 
has several drawbacks: First, it necessitates a sophisticated support on behalf of the storage 
hardware such as SCSI controllers enhanced with device locks (see [35, 9]), or object store 
controllers (see [36, 22]). These hardware enhancements still remain proprietary and it is 
unclear whether they will be accepted by the storage manufacturers in the future. 

Second, data is frequently replicated on several storage units (e.g., a file may be striped, 
or mirrored) for availability and fault-tolerance. As a result, it is desirable to have the locks 
replicated as well so that the same level of availability is preserved. Unfortunately, as it 
was proved in [24], it is impossible to use a collection of fail-prone strong objects (such as 
test-and-set, compare-and-swap, etc.) to implement a reliable one. 

We therefore opt for an alternative approach which is to build locks from weaker objects, 
i.e., read/write registers. Thus, deployment becomes a non-issue, as designating a read/write 
word per file or per block on a disk is trivially done. In case that multi disk locking is required, 
a single reliable read/write register is implementable using a farm of failure-prone storage 
units (see, e.g., [7, 8, 13]). In the remainder of this paper, we largely ignore replication 
and follow a modular approach: i.e., we will assume that reliable registers are available, and 
develop algorithms in a shared memory model with reliable registers. 

Uniform solutions. It is known that supporting mutual exclusion with read/write regis- 
ters incurs a cost that is linear in the maximum potential number of participating processes, 
in terms of both the memory consumption and the number of shared memory accesses [12]. 
Indeed, many similar abstractions such as failure detectors, or the Q leader oracle of [16], are 
defined for a group of known members. To circumvent this limitation, we adopt a timing- 
based locking approach that was originally suggested by Fischer [26]. This results in a very 
simple locking protocol, that uses a single read/write register per data item to support 



exclusion among a priori unknown (but eventually finite) number of client processes. 

We enhance Fischer's scheme with a number of important modifications. First, in order to 
support automatic recovery of the locks held by failed processes, we augment the scheme with 
an expiration mechanism so that a lock is leased to a process for a pre-defined time period. 
Once the lease period expires, the lock is relinquished and subsequently, can be granted 
to another process. (In the following, we will use terms locks and leases interchangeably). 
Another important extension we present is the support for automatic lease renewal. This 
leads to efficient utilization of the lease by a leader who holds the lease and continues doing 
useful work. 

Reaching coordination. There still remain tasks that are best handled by a coordinated 
group of SAN managers. For example, SAN managers need to administer volume assign- 
ments and configuration information. The common approach for reaching consensus among 
multiple servers in such tasks is to employ the Paxos paradigm [27]. This paradigm preserves 
uniqueness of decisions through a three phase commit protocol, and relies on timeliness con- 
ditions for progress. Our leases serve as a fundamental enabler of the Paxos paradigm 
in storage-centric systems, and a necessary building block for the agreement algorithms in 
[19, 14]. Our leases guarantee exclusion to clients once the system stabilizes (and remains 
stable for long enough), regardless of any past timing violations. This allows our lease to 
support an eventual leader-election primitive, a necessary building block for implementing 
dependable services resilient to timing failures. 

We show a simple lease based solution for fault-tolerant consensus that guarantees agree- 
ment at all times but can fail to make progress when the system is unstable. The latter can 
be used to realize efficient, always safe fault-tolerant locking using a hierarchical approach 
described by Lampson in [28]. 

Contribution. In the remainder of this exposition, we provide a formal treatment of the 
problem at hand, in which disks are simply considered to be persistent shared memory 
containers accessed by multiple fail-prone clients. Our work provides the following formal 
contribution. It gives a specification of leases, including a renewal operation. It provides an 
efficient way to implement leases for an unbounded number of unreliable client processes. 
The solution applies ideas originally developed for mutual exclusion in synchronous shared 
memory to derive light-weight lease primitives for highly decentralized and unreliable dis- 
tributed settings. Finally, we show a simple lease based solution for fault-tolerant Consensus 
which is a benchmark distributed coordination problem. 

2 Related Work 

In this paper we apply the real-time mutual exclusion theory to support locking in practical 
SAN-based systems. In the following, we survey the current state-of-the-art in these two 
areas. 



2.1 Locking Support in SAN-based file systems 

Traditionally, SAN-based file systems rely on separate servers to maintain their meta-data 
and coordinate access to the user data on storage devices. The meta-data servers can be repli- 
cated for better availability and load balancing. The server replicas are kept in a consistent 
state using a group-communication substrate. However, the cluster of replicated meta-data 
servers still remains the performance and availability hotspot as all the file-system operations 
(even those targeted to different objects) must consult the meta-data servers before accessing 
the storage. Examples of the systems whose design follows this approach include the IBM 
General Parallel File System (GPFS) [37] and IBM StorageTank [32]. More examples can 
be found in [22]. 

The vision of a storage-centric locking was first realized in the Global File System (GFS) 
project [34, 38, 39] developed in the University of Minnesota. In GFS, the cluster nodes 
physically share storage devices connected via a high-speed network. GFS utilizes fine grain 
test-and-set locks provided by specialized SCSI devices [35, 9] to implement atomic execution 
of file system operations. 

Amiri et al. [6] proposes base storage transactions (BSTs) as a core paradigm for main- 
taining low-level integrity of striped storage (such as RAID) in the face of concurrent client 
accesses. In particular, the paper discusses device-served locking as an alternative to tradi- 
tional centralized locking schemes. It demonstrates through an extensive empirical perfor- 
mance study that device-served locking provides better performance under high contention, 
and is therefore, more scalable. 

zFS [36, 22] is a research file system implemented over object store devices [33] directly 
accessible over a SAN. In zFS, each storage device maintains a coarse grain lock which can 
be used by a lease manager to obtain an exclusive access (a major lease) to the entire device. 
The lease manager is then responsible for administering fine grain locks to clients requesting 
access to individual data items stored on the device. 

The symmetrical locking mechanisms above all guarantee availability of lock information 
in face of process failures. However, none of these systems support data and lock replication 
and therefore, do not guarantee availability in the face of storage device failures. As a partial 
solution, a reliability hardware (such as RAID) may be employed in these systems to mask 
the storage failures to some extent. In addition, both GFS and zFS require sophisticated 
storage hardware which must be able to support read-modify-write instructions and, in the 
case of zFS, also be capable of measuring real time passages. 

2.2 Time Based Mutual Exclusion 

Algorithms for mutual exclusion in the presence of failures must be based on timeliness 
assumptions, as they have to be able to attain progress in spite of process failures while 
executing in their critical section. There are two commonly used timing assumptions in this 
context: The known delay model of [3, 4, 5] and the unknown delay model of [2]. 

The known delay model was first formally defined in [4]. The first mutual exclusion 
algorithm explicitly based on the known delay assumption was the famous Fischer algorithm, 
which was first mentioned by Lamport in [26]. In [26], another timing based algorithm is 



presented. This algorithm assumes a known upper bound on time a process may spend in 
the critical section. 

Alur et al. consider in [2] the unknown delay model: The time it takes for a process to 
make a step is bounded but unknown to the processes. The paper presents algorithms for 
mutual exclusion and Consensus in this model. A remarkable feature of these algorithms 
is their ability to preserve safety even in completely asynchronous runs. However, they are 
guaranteed to satisfy progress only if the system behaves synchronously throughout the entire 
run. The mutual exclusion algorithm of [31] combines the ideas of Fischer and Lamport's 
fast mutual exclusion algorithm [26] to derive a timing based algorithm that guarantees 
progress when the system stabilizes while being safe at all times. However, the algorithm 
of [31] is not fault-tolerant. 

As far as we know the eventual known delay timed (OND) model introduced in this 
paper was never considered in the shared memory context. Most of the existing time based 
algorithms are either not fault-tolerant [4, 5], or resilient only to the timing failures [31, 2]. 
The fault-tolerant (wait-free) timing based algorithms of [3] are not suitable for the OND 
model as they might violate safety and/or liveness even during synchronous periods if the 
delay constraints do not hold right from the beginning of the run. 

The OND model considered in this paper is an extension of a standard asynchronous 
shared memory model to include timeliness assumptions based on the absolute real-time. 
To this end, the OND model postulates the existence of bounded drift local hardware clocks 
accessible to each process. In this respect, the OND model closely resembles the timed 
asynchronous model of Cristian and Fetzer defined in [17]. An alternative approach to 
model timeliness in shared memory environments is to postulate the existence of a known 
upper bound on relative process speeds as it is done by Lynch and Shavit in [31]. This 
results in a model analogous to the partial synchrony model of [18]. However, as is, the 
partial synchrony model of [31] is inappropriate for our purposes as it does not distinguish 
between local process steps and those involving a shared memory access. This distinction 
is important if non-atomic shared objects (such as regular registers) are assumed. Relaxing 
the partial synchrony model of [31] to allow non-atomic memory access as well as evaluating 
applicability of other timed models (e.g., [1], or the timed I/O automata model of [25]) 
remains a subject of the future work. 

Other properties that are of interest to us is the ability of timing based algorithms to 
support exclusion among arbitrarily many client processes and to work with weaker regis- 
ters and/or a small number thereof. The latter is particularly important in failure prone 
environments as in these environments the registers must be first emulated out of possibly 
faulty components. In this respect the original solution by Fischer is superior to all the other 
algorithms as it is based on a single multi- writer multi-reader register. In fact, as we show 
in this paper, the register is only required to support regular semantics (in the sense of [13]), 
and hence may be emulated efficiently even in a message passing setting. This solutions was 
therefore chosen as a basis for our lease implementation. The algorithms of [31] and [4] are 
also oblivious to the number of participants and use two and three shared atomic registers 
respectively. 

The goodness of timing based mutual exclusion algorithms are frequently assessed in 
terms of their performance in contention free runs. In particular, a good algorithm is expected 



to avoid delay statements when there are no contention. The performance of the timing based 
algorithms under various levels of contention is analyzed in [20]. The paper examines (both 
analytically and in simulations) the expected throughput of timed based mutual exclusion 
algorithms under various statistical assumptions on the arrival rate and the service time. 
The question of further optimizing our leases approach for contention free runs is left for 
future research. 

2.3 Other work on locks and leases 

Gray and Cheriton were the first to employ leases in [23] for constructing fault-tolerant 
distributed systems. Lampson advocates in [28, 29] the use of leases to improve the Paxos 
algorithm. Boichat et al. [10] introduce asynchronous leases as an optimization to the atomic 
broadcast algorithms based on the rotating coordinator paradigm. Chockler et al. [15] show a 
randomized backoff based algorithm for implementing leases in a setting similar to the OND 
model of this paper. However, the algorithm of [15] guarantees progress only probabilistically, 
and relies on shared objects that can measure the passage of time. Finally, Cristian and 
Fetzer [17] show an implementation of leases in timed asynchronous message passing systems. 

3 System Model 

We will start by defining a basic asynchronous shared memory model and the regular register 
properties (Section 3.1). We will follow the basic formalism of [13]. Then, in Section 3.3, 
we augment the basic model with necessary timeliness assumptions by adapting the timed 
asynchronous model of [17] to the shared memory environment. 

3.1 The Basic Model 

Our basic model is an asynchronous shared memory model consisting of finite but a priori 
unknown universe of processes Pi,P2, ■ ■ • communicating by means of a finite collection of 
shared objects, Oi, . . . ,O n . Every shared object has a sequential specification defining the 
object behavior when accessed sequentially. A sequence of operations on a shared object 
is legal if it belongs to the sequential specification of the shared object. In this paper, we 
reduce our attention to read/write shared objects. A sequence of operations on a read/write 
shared object is legal if each read operation returns the value written by the most recent 
write operation if such exists, or an initial value otherwise. 

The operations on objects have non-zero duration, commencing with an invocation and 
ending with a response. An execution of an object is a sequence of possibly interleaving 
invocations and responses. For an execution a and a process pi, we denote by a\i the 
subsequence of a containing invocations and responses performed by p, r Processes may fail 
by crashing. A process is called correct in an execution a if it never crashes throughout 
a. Otherwise, a process is called faulty in a. A threshold t of the objects may suffer non- 
responsive crash failures [24], i.e., may stop responding to incoming invocations. 

An execution a is admissible if the following is satisfied: (1) Every invocation by a correct 
process in a has a matching response; and (2) For each process pi, a\i consists of alternating 



invocations and matching responses beginning with an invocation. In the rest of this paper, 
only admissible executions will be considered. 

Given an execution a, we denote by ops(a) (resp. write(aj) the set of all operations 
(resp. all write operations) in a; and for a read operation r in a, we denote by writes^ r the 
set of all write operations w in a such that w begins before r ends in a. The operations in 
ops(a) are partially ordered by a — > a relation satisfying o\ — » CT 02 iff o\ ends before o<i begins 
in a. In the following, we will often omit the execution subscript from — > if it is clear from 
the context. 

Our definition of regularity for a multi-reader/multi-writer read/write shared object is 
similar to the MWR2 condition of [13]. It is as follows: 

Definition 1 (Regularity). An execution a satisfies regularity if there exists a permutation 
7v of all the operations in ops(a) such that for any read operation r, the projection 7r r of tt 
onto writes^ r U {r} satisfies: 

1. 7r r is a legal sequence. 

2. 7r r is consistent with the — > relation on ops(a). 

A read/write shared object is regular if all its executions satisfy regularity. 

3.2 Masking object failures 

Given a collection of n > It shared objects up to t of which can suffer from non-responsive 
crash failures, it is possible to construct a wait-free regular register defined in the previous 
section (see e.g., [13, 8]). The resulting reliable registers can then be used to construct higher 
level services. Hence, in this paper we will follow a modular approach: i.e., we will assume 
that reliable registers are available, and develop algorithms in a shared memory model with 
reliable registers. 

3.3 The Augmented model 

In the augmented model, each process is assumed to have access to a hardware clock with 
some predetermined granularity. We also assume that each process can suspend itself by 
executing a delay statement. Thus, a call to delay(t) will cause the caller to suspend its 
execution for t consecutive time units. We model the system behavior as a General Timed 
Automaton (GTA) [30] which is a state machine augmented with special time-passage events 
v(t), t 6 1 The time-passage event v{t) denotes the passage of real time by the amount t. 

The system is called stable over a time interval [s,t], called a stability period, if the 
following holds during [s,t\: (1) The processes' clock drift with respect to the real-time is 
bounded by a known constant p. For simplicity we assume that p = (it is easy to extend 
our results to clocks with p ^ 0); and (2) The time it takes for a correct process to complete 
its access to a shared memory object, i.e., to invoke an operation and receive a reply, is 
strictly less than a known bound S. 

In the following, we will be interested mainly in properties exhibited by the system during 
stability periods. To simplify the presentation, we will consider a timed model, which we 
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Figure 1: Well-formed interaction of process i and the A-Lease object 

call an Eventually Known Delay Timed model, or OND, with stability periods of infinite 
duration: i.e., we assume that for each run there exists a global stabilization time (GST) 
such that the system is stable forever after GST (i.e., during [GST, oo]). In the remainder 
of the presentation, all properties and correctness proofs regard operations the start after 
GST. 

We will also consider a special case of the OND model, which we call a Known Delay 
Timed model, or ND, that requires each run to be stable right from the outset. 



4 The Lease Specification without Renewals 

We define the A-Lease object as a shared memory object that can be concurrently accessed 
by any number of processes, and whose interface consists of the following two operations for 
each process i: contenDj and release^ The responses to these operations are acki. We 
assume that the interaction between each process i and the lease object is well-formed in the 
sense that it is consistent with the state diagram depicted in Figure 1. 

A process that is not holding a lease is in the state Free. We assume that each process 
execution always starts from the Free state. A process that attempts to acquire the lease, 
invokes contend and moves to the state Try. Once contend returns, the process moves 
to the state Hold assuming the lease for the next A time units. Once the lease expires, the 
process moves to the Exit state. At this state, the application invokes release and returns 
to the Free state upon the ack response. 



In the states Free, Hold and Exit, the process executes the code specified by the applica- 
tion program. We do not put any restrictions on the time spent in the Free state (indicated 
by t > time passage). However, we assume that the transition from the state Exit to the 
Release state is instantaneous (indicated by a time passage). 

A A-Lease object is required to satisfy the following property after time t > GST: 

Property 1. At any point in an execution, the following holds: 

1. Safety: At most one process is in the Hold state. 

2. Contend Progress: If no process is in the Hold state, and some correct process is in the 
Try state, then at some later point some correct process enters the Hold state. 

3. Release Progress: At any point in an execution, if a correct process i is in the Release 
state, then at some later point process i enters the Free state. 



5 The Lease Implementation 

The A-Lease object implementation appears in Figure 2. It utilizes a single shared multi- 
reader multi-writer regular register x. A process that tries to acquire the lease writes a 
unique timestamp to the register x and delays for 28 time. If upon the delay expiration, 
the process reads its own value back, then it acquires the lease and enters the Hold state. 
Otherwise, it backs off to the loop in lines 4-8, where it waits until the current lease holder 
either relinquishes the lease, or the lease period A expires without release being called. 
The latter could happen if the current lease holder crashes before calling release. Note 
that each process has to write a unique timestamp (e.g., id and a sequence number) into x. 
This is necessary in order to prevent a process that acquires the lease for several times in a 
row from being falsely suspected by other processes. 

Upon release, a special _L value is written to x to indicate the fact that no process 
is currently holding the lease. This way a newly contending process could avoid the delay 
statement in line 2.6 and proceed directly to 2.9. 

We now prove that the implementation in Figure 2 satisfies the A-Lease object properties. 

Throughout the proof, we make use of the following assumptions and notations. Let 
L be a contend operation. We denote the sequence of read/ write operations by which L 
terminates by: 

L.r', (delay A + 55), L.r" ', L.w, (delay 28), L.r . 

That is, denote by L.w the last write operation invoked during L (i.e., the last time line 10 
in Figure 2 is activated). Denote by L.r the read operation that follows L.w (on line 12). 
If there exists a read operation invoked from line 7, denote by L.r" the one immediately 
preceding L.w. If L.r" exists, it is immediately preceded by a read operation L.r' from line 
1 or line 12 followed by a delay of (A + 58). Otherwise, let L.r' be the last read operation 
during L from line 1 or line 12 that precedes L.w. 

Finally, for the execution considered in all proofs, let tt be a serialization of the operations 
that upholds the regularity of x. 
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Shared: 




x € TS±; 


Local: 

Xi,X'„ 


i e TS ± . 


CONTEND: 


(1) 


x 2 <— read(x); 


(2) 


do 


(3) 


if (x 2 ^ J.) then { 


(4) 


do 


(5) 


X\ <r- x 2 ; 


(6) 


delay(A + 55); 




/* A + 65 for the OND renewals */ 


(7) 


x 2 <— read(x); 


(8) 


until 3ii=i2V3:2 = -L; 




} 


(9) 


Generate a unique timestamp ts; 


(10) 


write(x,ts); 


(11) 


delay(25); 


(12) 


x 2 <— read(x); 


(13) 


until £2 = ts; 


(14) 


return ack; 


RELEASE: 




write(x, _L); 




return acA;; 



Figure 2: The A-Lease Implementation. 

Lemma 1. Let L be a contend operation invoked by process p that returns at time t . 
Denote sq = to + A the expiration time of L . Then for all contend operations L such that 
L.w appears in tt after L .w, if L.r" is invoked, then it is invoked after so + 5. 

Proof. Assume to the contrary, and let L be a contend operation such that L.w is the first 
write in tt that breaks the conditions of the lemma. 

Clearly, L.w does not precede L .r in 7VL . r , for else L .r cannot return the value written 
by L .w. Furthermore, since all write operations w such that w — > L .r must appear in tt Lo r 
before L .r, and because by assumption L .w precedes L.w in it, L.w -ft L .r. Putting this 
together with the fact that the response of L .w and the start of L .r are separated by a 25 
delay, we have L .w — > L.r" (see Figure 3(a)). Hence, L .w e TVL.r"- 

Next, we show that L .w is the last write preceding L.r" in TVL.r"- Let I' / L be a 
contend operation such that L'.w is between L .w and L.r" in 7r L . r ». By assumption, L'.r" 
must be invoked after s + 5. Since, by definition of 7r L . r », L'.w must be invoked before L.r" 
returns, L.r" returns after s + 5, as depicted in Figure 3(b). Since L'.w is invoked after 
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Figure 3: Possible placements of overlapping contend operations L and L. 

s + 8, and since by assumption, L.r' finishes before s + 8, we get that L.r' — > L'.wj. Putting 
this together with the assumption that L'.w precedes L.r" in 7r L . r », we obtain that L.r' and 
L.r" will return different values in which case the lease implementation implies that the write 
statement is not reached. Hence, L.w could not have been invoked. Thus, L .w is the last 
write preceding L.r" in ttl.t" implying that L.r" returns the value written by L .w. 

By construction, L.r" is preceded by a 58 + A delay preceded by another read operation 
L.r' such that the timestamp values returned by these two read's are identical. However, it 
is easy to see that L .w is contained in full between these two reads. Indeed, we already 
know that L .w — > L.r". We now show that L.r' — > L .w. Indeed, the earliest time that 
Lq.w can be invoked is s$ — A — 48. Since by assumption L.r" is invoked before sq + 8, L.r' 
returns before s + 8 — (A + 58) = s — A — 48 (see Figure 3(b)). Therefore, L.r' — > L .wj. 
Thus, regularity of ic and the timestamp uniqueness imply that L.r' and L.r" return different 
timestamps in which case the lease implementation implies that the write statement is not 
reached. Hence, L.w could not have been invoked. A contradiction. □ 

We are now ready to prove Safety. 

Lemma 2 (Safety). The implementation in Figure 2 satisfies Property 1.1. 

Proof. Let L be a contend operation by process p that returns at time t. Denote s = t + A. 
Suppose to the contrary that another CONTEND operation L' returns at time t' within the 
interval [t, s]. 

First, suppose that L'.r" has never been invoked. Then, L.r' must have returned _L. 
Therefore, L.r' must have been invoked before L.w returns. Therefore, L'.w returns before 
L.delay(28) terminates. Hence, L'.w — > L.r, and by regularity of a;, both L.w and L.w' must 
appear in both 7r L , r and ttl.t 1 - Since L.r returns the value written by L.w, L.w' precedes L.w 
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in 7r. However, by assumption, L.r' must return the value written by L.w' . Therefore, L.w 
precedes L.w' in it. A contradiction. 

Next, suppose that L'.r" was invoked. Then, it must have been invoked before s + 8. By 
Lemma 1, putting L = L we get that L.w does not precede L'.w in it. Second, L.r" must be 
invoked before £', and a fortiori, before t' + A + 8. Applying Lemma 1 again, with L = L', 
we get that L.w' does not precede L.w in it. A contradiction. D 

We now turn our attention to proving Progress. We first prove the following technical fact. 

Lemma 3. Let q be a process that performs an operation w\ = write that returns at time 
t. If no process returns from a contend operation after t, then for each s > t, the interval 
[s, s + 58] contains a complete write invocation (i.e., from its invocation to its response). 

Proof. Suppose to the contrary. By assumption, no write operation is invoked between s 
and s + 48. Let W be the last write invoked before s, or possibly the set of concurrent, latest 
writes invoked before s. Formally, W is the set of all w such that (1) w is invoked before 
s; and (2) for any write w' invoked by s + 48, w -/} w'. W is not empty because W\ starts 
before s, and no write is invoked in the interval [s, s + 48]. 

Let w e W, and let r = read be the corresponding read operation, invoked by the same 
process 28 after w. We claim that (i) W — > r, and (ii) there does not exist any write event 
u) in 7i> that follows W in tt such that W — > u and u is invoked before r returns. 

To see that (i) holds, let w' € IT 7 . Since w -/* if', we have that w' terminates at most 8 
after w; since r starts 28 after wj's termination, w' — > r. To see (n), first note that if IT — > cj, 
then by definition w cannot be invoked before s. Second, by assumption, no write is invoked 
between s and s + 48, but r terminates by s + 45 at the latest. So u cannot be invoked before 
r returns, and hence is not in 7i>. 

Hence, by the regularity of x, all read's corresponding to writers in W must return the 
value of the last write in tt from W. The read corresponding to this write then sees x 
unchanged, and its initiator is allowed to obtain the lease. A contradiction. □ 

Lemma 4 (Progress). The implementation in Figure 2 satisfies Property 1.2. 

Proof. Suppose that no process is holding the lease at time t. Let p be a correct process 
that is still contending at t. Suppose for contradiction that no contend operation returns 
after t. 

First, eventually some process, say q\, invokes an operation w\ = write. This is due to 
the fact that the wait-loop at the start of the contend algorithm (lines 2.4-8) terminates 
at some process when no write's are performed. 

By Lemma 3, if there is no successful CONTEND after w\ returns, then every instance 
of the loop by q\ observes at least one new written value. Thus, the test in line 2 remains 
false. Hence, q\ does not perform any further writers. Let an operation w<i = write by q<i be 
observed by q\. Again, so long as there is no successful contend, by Lemma 3, q 2 performs 
no further writers. And so on. 

Since the number of processes is finite, eventually all processes are in their wait loop and 
no process writes. This is a contradiction. □ 

Finally, since the RELEASE code is trivially live, we proved the following 
Theorem 1. The implementation in Figure 2 satisfies Property 1. 
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Figure 4: Well-formed interaction of process i and the A-Lease object with renewals 

6 Lease renewals 

In many situations, it is important to enable the current lease holder to renew its lease 
without contention. For example, this is the case when a lease holder requires more time 
to complete an operation than the alloted period. Another example is the use of leases to 
obtain a leader, in which case we wish the leader to perpetuate so long as it is alive. 

In this and the following section, we consider lease renewals. We start by extending the 
lease specification in Section 4 to include lease renewals. 

The A-Lease object with renewals supports for each process i, an additional reneWj 
operation whose response is either true^ or false^. The extended well-formedness condition 
is given by the state diagram depicted in Figure 4. It allows an application in the Exit state 
to attempt lease renewal by calling the renew operation. If the call to renew returns 
true, the process assumes the lease for another A time units. Otherwise, it returns to the 
state Free. Note that a process is allowed to renew its lease for several times in a row before 
relinquishing the lease with the RELEASE operation. 

In addition to Property 1, a A-Lease object with renewals is required to satisfy the 
following properties after time t > GST: 

Property 2. At any point in an execution, the following holds: 

1. Renewal Safety: If a correct process i is in the Renew state, then no other process is 
in the Hold state. 
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2. Renewal Progress: At any point in an execution, if a correct process i is in the Renew 
state, then at some later point process i enters the Hold state. 



7 Implementing Renewals 

In this section we address the lease renewals implementation. We consider two implemen- 
tation options: The first one is suitable for the ND model, and is extremely efficient. The 
second one works in the OND model, and guarantees stabilization of renewal: Only one 
renewal emerges successfully after GST, despite any unstable past periods, and despite the 
possible existence of multiple simultaneous lease holders before GST. The OND renewal 
protocol is somewhat more costly. 

7.1 ND renewal 

The renewal implementation in the ND model is extremely simple: A process whose pre- 
viously granted lease expires can renew it for another A time units by simply executing 
lines 8-9 of the A-Lease implementation in Figure 2. More precisely, we define the renew 
operation as follows: 

RENEW: 

Generate a unique timestamp ts; 

write(x,ts); 

return true; 

We now prove the correctness of the ND renewal scheme. Since liveness trivially holds, 
we are only left with proving safety. 

Lemma 5. Consider a sequence I = L^rnxrni . . . rn^ of lease operations by process p. Sup- 
pose that Lq is a successful contend operation that returns at time t , and rn^ is a successful 
renew operation that returns at time tj. Then there exists no contend operation L by pro- 
cess q =/z p such that L.w is invoked within the interval [t , t& + A + 25]. 

Proof. By induction on length of L For the base case, let I = L rni. Suppose to the contrary 
that there exists a contend operation L such that L.w is invoked within [t ,ti + A + 25]. 
First, note that L .w — > L.w, and therefore, L .w precedes L.w in it. Therefore, by Lemma 1, 
L.r" must be invoked after to + A + 5. Since rn\.w is invoked at to + A, it must return by 
to + A + 5, and therefore, rn\.w — > L.r" . Since L.r" is invoked before t\ + A + 25, L.r' 
returns before t\ + A + 25 — (A + 55) = t\ — 35. Since rn\.w must be invoked at t\ — 5 the 
earliest, L.r' — > rn\.w. Therefore, by regularity of ic and timestamp uniqueness, L.r' and L.r" 
will return different values violating the necessary condition for the write statement of the 
contend implementation to be reached. Hence, L.w cannot be invoked. A contradiction. 

Assume that the result holds for all sequences I of length k — 1, and consider a sequence 
£' = I rnk- Assume to the contrary. By the inductive assumption, L.w must be invoked 
after t(&_i) + A + 25. Therefore, rn^.w — > L.r". On the other hand, L.r" must be invoked 
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before tk + A + 28. Therefore, L.r' must return before tk — 35. Since the earliest time 
rrik-w can be invoked is tk — 8, L.r' — > L.w. Therefore, by regularity of x and timestamp 
uniqueness, L.r' and L.r" will return different values violating the necessary condition for 
the write statement of the contend implementation to be reached. Hence, L.w cannot be 
invoked. A contradiction. □ 

Lemma 6. Suppose that a process p returns from a renew operation rn at time t. Then, 
there exists no process q ^ p whose renew operation rn' returns within the interval [t, t+A]. 

Proof. Suppose to the contrary that rn' returns at time t' within the interval [t,t + A]. 
By well-formedness, both p and q must have been invoked contend operations L and L' 
in the past to acquire their initial leases. Suppose that L and L' return at times c < t 
and c' < t' respectively. Assume, w.l.o.g, that c < c'. By Lemma 5, putting to = c and 
tk = t + A, and because t' < t + A, we get that the lease period of L' overlaps with [to, s/t]. 
A contradiction. □ 

The following lemma follows immediately from Lemma 5 and Lemma 6. 

Lemma 7 (ND Renewal Safety). The ND renewal implementation satisfies Properties 1.1 
and 2.1. 

We proved the following: 

Theorem 2 (ND Renewal Correctness). The ND renewal implementation satisfies Prop- 
erties 1 and 2. 

7.2 OND renewal 

The renew operation implementation for the OND model is shown in Figure 5. For sim- 
plicity, we require that timestamps consist of two fields: the process id and a monotonically 
increasing counter. 

Throughout the proof of correctness of the OND renewal scheme, we make use of the 
following notation. Let L be a contend or renew operation. As in the previous section, 
we denote the sequence of read/ write operations by which L terminates by: 

(in contend only: L.r', delay A + 68), L.r", L.w, (delay 28), L.r . 

That is, L.w is the last write operation invoked within L, and L.r", L.r and the read opera- 
tions immediately preceding and following L.w, respectively. If L is a contend operation, 
and there exists a read operation invoked from line 7 of Figure 2, then L.r" denotes the one 
immediately preceding L.w. If L.r" exists, it is immediately preceded by a read operation 
L.r' from line 1 or line 12 of Figure 2 followed by a delay of (A + 58). Otherwise, let L.r' be 
the last read operation during L from line 1 or line 12 of of Figure 2 that precedes L.w. 

then in addition, the read operation preceding L.r" is denoted L.r'. 

Finally, for the execution considered in all proofs, let tt be a serialization of the operations 
that upholds the regularity of x. 
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RENEW 




(1) 


X\ <— read(x); 


(2) 


if (xi.id ^ ts.id) then 


(3) 


return false] 


(4) 


ts.counter <— ts. counter + 1; 


(5) 


write(x,ts); 


(6) 


delay(2S); 


(7) 


X\ <— read(x); 


(8) 


if (xi = ts) then 


(9) 


return £r?/e; 


(10) 


else 


(11) 


return false; 


Figure 5 


: OND Renew Implementation. 
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Figure 6: Overlapping renewals. 

Lemma 8. Let L be a lease operation (contend or renew) invoked by process p that returns 
successfully at time to. Denote sq = to + A the expiration time of L . Then there exists no 
write operation w in tt after L .w, such that w is invoked before s + S. 

Proof. Assume to the contrary, and let L.w be the first write in tt that breaks the lemma. 

Clearly, L.w does not precede L .r in 7VL . r , for else L .r cannot return the value written 
by Lq.w. Furthermore, since all write operations w such that w — > L .r must appear in 7VL . r 
before L .r, and because by assumption L .w precedes L.w in it, L.w -ft L .r. Putting this 
together with the fact that the response of L .w and the start of L .r are separated by a 25 
delay, we have L .w — > L.r" (see Figure 6). Hence, L .w € 7r Lr ». 

Furthermore, by assumption L.w is the first write such that (1) L.w follows L .w in tt; 
and (2) L.w is invoked before sq + S. Since L .w € kl.t" any write w ^ L.w that follows 
Lq.w € TVL.r" must be invoked after s + S. Since, by definition of TVL.r", w must be invoked 
before L.r" terminates, L.r" terminates after s + S. Consequently, L.w would be invoked 
after s + S contradicting the assumption. Since L.w 7r L . r », the only remaining possibility 
is that Lq.w is the last write in TVL.r", and so L.r" returns the value of L .w. 

Next, we consider the case that L is a contend operation separately from the case that 
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it is a renew operation. First, consider that L is a renew operation. Then the analysis 
above shows that L.r" returns the timestamp written in L .w, hence L is unsuccessful. 

Second, assume that L is a contend operation. Here, L.r" is preceded by a 68 + A 
delay preceded by another read operation L.r': and the timestamp values returned by these 
two read's are identical. However, it is easy to see that L .w is contained in full between 
these two reads. We already know that L .w — > L.r". We now show that L.r 1 — > L .w. 
Indeed, the earliest time that L .w can be invoked is sq — A — 48. Since by assumption L.w 
is invoked before sq + 8, L.r' is invoked before sq + 8 — (A + 68) = sq — A — 58. Therefore, 
L.r' — > L .wj. Thus, regularity of ic and the timestamp uniqueness imply that L.r' and L.r" 
return different timestamps in which case the lease implementation implies that the write 
statement is not reached. Hence, L.w could not have been invoked. A contradiction. □ 

We are now ready to prove Safety: 

Lemma 9. Assume that a lease operation L (^contend or renew,) by process p returns 
successfully at time t. Let s = t + A. Then there exists no successful contend or renew 
operation L' by a process q ^ p that returns during the interval [t, s]. 

Proof. Suppose to the contrary that L' returns successfully at time t' within the interval 
[t, s]. First, L'.w must be invoked before s + 8. By Lemma 8, putting L = L we get that 
L.w does not precede L'.w in n. Second, L.w must be invoked before £', and a fortiori, before 
t' + A + 8. Applying Lemma 8 again, with L = L', we get that L.w' does not precede L.w 
in 7r. A contradiction. □ 

Lemma 10. Assume that a renew operation L by a process p is invoked at time t\ and 
returns successfully at time t 2 . Then there exists no successful contend or renew operation 
L' by a process q ^ p that returns during the interval [ti,^]- 

Proof. Suppose to the contrary that L' returns at a time t' within the interval [ti,t 2 ]. First, 
L'.w must be invoked before s + 8. By Lemma 8, putting L = L we get that L'.w must 
precede L.w in it. Furthermore, applying Lemma 8 again with L = L', we get that L.w 
must be invoked after t' + A + 8. Therefore, L'.w — > L.r" so that L'.w € ttl.t", and L'.w 
precedes L.r" in 7r L . r ». 

First, suppose that L.w is the first write operation by p in tt after L'.w. Hence, there is 
no write operation by p in TTL.r" following L'.w. Then by regularity of x, and because L is 
a renew operation, L.r" returns a timestamp written by a process q ^ p, contradicting to 
the fact that L is successful. 

Next, suppose that there exists a write operation L".w by p in tt l r u that follows L'.w. 
Since L is a renew operation, L" must be the successful lease (renew or contend) 
operation immediately preceding L. Applying Lemma 8 with L = L', we get that L".w must 
be invoked after t' + A + 8 implying that L starts after t' + A + 8 (i.e., t\ > t' + A + 8). □ 

We proved the following 

Theorem 3 (Renewal Safety). ()ND renew implementation satisfies Properties 1.1 and 

2.1. 
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Finally, we prove Liveness: 

Lemma 11. Assume that a correct process p obtains the lease in a contend or renew 
operation L at time t. Then, a renew operation rn invoked by p at s = t + A, returns 
successfully. 

Proof. For rn to be successful, first rn.r" must return the timestamp written by L.w. This 
holds by the fact that L.r returns the value of L.w, and by Lemma 8, since no other write 
operation that follows L.w in tt is invoked before s + A + 8. 

Second, rn.r needs to return the value written by rn.w. Suppose to the contrary that 
some lease operation L' overwrites rn.w. Let L'.w be the first write in tt by process q ^ p 
that follows L.w and precedes rn.r in 7r rn . r . 

By Lemma 8, L'.w is invoked after s + 8. Hence, L.w — > L'.r". Since L.'w is the first write 
to follow L.w, and since L'.r" — > L'.w, we have that L'.r" returns the timestamp written 
by p in L.w. By construction, this occurs only if L' is a CONTEND (not renew) operation. 
Still, for L'.w to be invoked, L'.r' and L'.r" must return the same timestamp. We now show 
this is impossible. 

We already know that L.w — > L'.r". By construction, L'.r" follows a delay of A + 68 
after the termination of L'.r'. If L'.r" is invoked no later than s + 28, then L'.r' terminates 
by s — A — 48. Since the earliest that L.w is invoked is t — 48, we have L'.r' — > L.w. We get 
that L.w is a wnte that occurs completely between L'.r' and L'.r", and so they must return 
different timestamps. 

We are left with the possibility that L'.r" is invoked after s + 28. Because L'.w precedes 
rn.r in 7r rn , r , the latest that L'.r" may be invoked is s + 58. Hence, L'.r' terminates by s — 8. 
We now get that rn.w is a write that occurs completely between L'.r' and L'.r", and so they 
return different timestamps. 

Hence, L.r' and L'.r" must see different values, in contradiction to the assumption that 
L'.w is invoked after L'.r". Hence, rn.r returns the same value as rn.w, and the renewal 
succeeds. □ 

We proved the following 

Theorem 4 (OND Renewal Correctness). The §ND renewal implementation satisfies 
Properties 1 and 2. 

8 Leader Election 

In this section we show the lease based implementation of the Boolean failure detector oracle, 
denoted C, that is required by the Consensus algorithms of [19, 14]. C is defined as follows: 
Let Ci denote the local instance of £ at a process Pi, with a boolean isLeader() operation 
returning the current value output by £j. Then, C is required to satisfy the following property 
eventually: 

Property 3 (Unique Leader). There exists a correct process Pi such that every invocation 
of Ci.isLeaderQ returns true, and for each process pj ^ pi, every invocation of Cj.isLeaderQ 
returns false. 
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The lease based implementation of C appears in Figure 7. A complete Consensus algo- 
rithm based on C appears in [14]. Here, we include it in Appendix A for completeness. 



Shared A-Lease object L; 


Local Boolean leader; 


(1) 


forever do 


(2) 


leader <— false; 


(3) 


L.CONTEND0; 


(4) 


leader <— true; 


(5) 


delay(A); 


(6) 


while(L.RENEw()) do 


(7) 


delay(A); 


(8) 


od; 


isLeader: 




return leader; 



Figure 7: The Lease-based Leader Oracle implementation 

The following theorem establishes the correctness of the leader oracle implementation in 
the OND model. 

Theorem 5. The pseudocode in Figure 7 eventually satisfies Property 3 in the §ND model. 



Proof. Let T > GST be the time such that all the leases acquired before GST have expired 
and all the faulty processes have crashed by T. Let Leader s T be the set of processes that are 
still leaders after T. If Leader s T ^ 0, then all the processes in Leader s T must be executing 
lines 6-7 of the code in Figure 7. By the renewal liveness, some of the processes renewing its 
lease at line 6 at the time t > T will succeed to renew its lease at each renewal attempted 
after t. By the renewal safety, starting from time t on, this process will remain the exclusive 
lease holder. 

If Leaders T = 0, then by the lease liveness, for some process p invoking L.contend() 
after GST, L. contend () will return at time t >T. By the renewal liveness, p will succeed 
to renew its lease at each renewal attempted after t. By the renewal safety, starting from 
time t on, p will remain the exclusive lease holder. □ 

9 Preliminary Performance Assessment 

To assess the scalability of the lease implementation, we carried out preliminary simulation 
studies. The simulation results appear in Figures 8 and 9. 

In our experiments, we assumed that read and write operations take times exponentially 
distributed with mean 1. Subsequently, the lease delays were measured in the units of the 
mean read/write delay. In all the experiments, 5 was set to 2, and A was set to 1. The choice 



18 




Simulation data (} 

Logarithmic fit 

I I I 



100 



Figure 8: Delay until the first client gets the lease 
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Figure 9: Delay until all the clients get the lease 
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of S = 2 is justified by both the exponential distribution properties, and the simulation stud- 
ies. The experiments vary the number n of contending processes. All contending processes 
start simultaneously, and contend for the lease once until they obtain it. Subsequently, they 
release it after A = 1 delay. The graph in Figure 8 shows the average delay until the first 
process obtains the lease as a function of the number of simultaneously contending processes; 
and the graph in Figure 9 shows the average delay until all the contending processes suc- 
ceed to obtain their leases. The first graph fits into a 0(ln(n)) curve and the second one 
fits into a 0(n + ln(n)) curve. These results suggest good scalability features for the real 
implementation and are also consistent with the exponential distribution analysis of [20]. 

Both analytical and empirical performance evaluation of the lease algorithms as well as 
their implementation in the real storage system is the subject of the ongoing work. 

10 Practical considerations 

There are a number of considerations worthy of noting in the context of practical distributed 
storage systems. First, a standard concurrency policy is to allow either multiple simultaneous 
readers, or one exclusive writer. Our leases easily support this paradigm. More specifically, 
in our scheme, access is granted to contending processes by writing their names onto a shared 
read/write register. Therefore, multiple-readers can be supported simply by having readers 
use a common name (e.g., "reader"), and writers use their own identity. 

Another important concern is caching. In a scalable system, a client obtaining a lease 
on a file may hold the file for some period of time, and work on a local cached copy of the 
file. However, the lease for the file has to be renewed periodically, which in our approach, 
implies writing to disk. The obvious concern is that lease-renewal could subvert the benefits 
of caching. 

We expect this not to be the case for several reasons. First, comparing our storage-centric 
lock-renewal with the standard lease-manager approach, it is disputable that writing to a 
disk over a modern SAN is less efficient than sending a message to the lease manager. First, 
an advanced storage controller (like IBM's Shark or Total Storage Volume Controller [21]) 
provides a sophisticated caching which is also fault-tolerant. So writing to a disk can be 
as fast as writing to a process. Moreover, measurements performed in [6] indicate that in 
scalable settings, the costs of accessing a remote disk are significantly outweighed by the 
overhead of going through a bottleneck lease manager. Further assessing the cost tradeoffs 
of our approach under different conditions is a topic of further study. 

Additionally, the performance gain of caching should be always weighed against the end- 
user guarantees. Suppose that a client holding a cached data is falsely suspected, and the 
lease is granted to another client. Then, when the original client eventually attempts to 
write the cached data back to disk, its write would be aborted to prevent inconsistency. 
Subsequently, all the modifications issued by the end-user will be lost. In order to provide a 
reasonable level of end-user semantics, the cached copy must be synchronized with the disk 
copy frequently enough. Thus, the lease renewal can be piggybacked on these synchronization 
messages. 
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A Uniform Consensus based on L 

Our Consensus implementation utilizes the ranked register primitive of [14] defined as follows: 
Let Ranks be a totally ordered set of ranks with a distinguished initial rank r such that for 
each r € Ranks, r > r ; and Vals be a set of values with a distinguished initial value v . We 
also consider the set of pairs denoted RVals which is Ranks x Vals with selectors rank and 
value. A ranked register is a multi-reader, multi-writer shared memory register with two 
operations: rr-read{r)i by process i, r € Ranks, whose corresponding response is value(V)i, 
where V € RVals. And xx-write(V)i by process i, V € RVals, whose reply is either commits 
or aborti. 

Definition 2. VFe say t/iat a xx-read operation R = xx-read(r 2 )i sees a xx-write operation 
W = xx-write((ri,v))j if R returns (r',v') where r' > r- L . 

The ranked register is required to satisfy the following three properties: 

Property 4 (Safety). Every xx-read operation returns a value and rank that was written in 
some xx-write invocation. Additionally, let W = xx-write((ri,v))i be a xx-write operation that 
commits, and let R = xx-read(r 2 )j, such that r 2 > r\. Then R sees W . 

Property 5 (Non- Triviality). // a xx-write operation W invoked with the rank r\ aborts, 
then there exists a xx-read (xx-write) operation with rank r 2 > r\ which is invoked before W 
returns. 

Property 6 (Liveness). // an operation (xx-read or xx-write) is invoked by a non-faulty 
process, then it eventually returns. 

The pseudocode of the Consensus implementation is shown in Figure 10. Please refer 
to [14] for the correctness proof. 
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Shared: Ranked registers rr, initialized by rr-write({ro, _L)) 

which commits; 

Regular register decision, with values in RVals, 

initialized by write((ro, _L}) 

Local: V G RVals U {abort} , 

r G Ranks: 

Process i: 

propose(f), Vals — > Vals 
r <-r ; 
while(true) do 

V <— decision.readQ; 
if (V.value + J.) 

return V.value; 
if (£j.isLeader()) then 
r «— chooseRank(r); 

V <- DECIDE({r, f)): 

if (F / abort) 

return V.value; 
fi 
od 

Function DECiDE((r, u)), i?Va/s -> ii!Vafe U {a&ori}: 
V •<— rr.rr-read(r)f, 
if (F.uakte = _L) then 

V.value <— v; 
V.rank <— r; 
if (rr.rr-write(V)i = commit) then 

decision.write(V); 

return V; 
fi 
return abort: 



Figure 10: Consensus using a ranked register and C 
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